Introduction¶

Our primary goals in our research and analysis are to boost Regork's sales and increase operational effectiveness. We have selected three problem statements- relating to the optimal package sizes to position in our business, to the best way to issue coupons to consumers in order to maximize sales, and to atttain maximum coupon redemption rate for more profitable marketing.

We worked on the problem statements and analysed the existing ‘CompleteJourney’ datasets relating to transactions, products, and demographics. We conducted the explorartory data analysis on the datasets to address the problem statements and derive meaningful insights from the data sets like Transactions, Demographics and Product.

Business Problem 1¶

Is the retail discount distribution in relation to the age groups ideal?Are discounts yielding higher sales?

For this after cleaning the data we have grouped the ages into 4 groups 19-34, 35-44, 45-54 and 55+ and formed a new data frame called "demo_tran". Then filtering the sales data based up on the retail discount and plotting the required graphs for generating insights addressing the business problem.

We found the below insights after peforming the analysis from the graphs:

  • We had noticed that revenue generated by issuing discounts to the 45-54 age group was higher compared to other age group
  • Discounts of above 5 dollars have produced little sales, so we can scale down on them.
  • Compared to other age categories, the 55 and older age group received the largest percentage of discounts.
In [1]:
from completejourney_py import get_data
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import calendar
df = px.data.tips()
import warnings
warnings.filterwarnings('ignore')
In [2]:
transactions = get_data()['transactions']
products = get_data()['products']
promotions = get_data()['promotions']
demographics = get_data()['demographics']
transactions = get_data()['transactions']
campaigns = get_data()['campaigns']
coupons = get_data()['coupons']
campaign_descriptions = get_data()['campaign_descriptions']
coupon_redemptions = get_data()['coupon_redemptions']
In [3]:
demographics.head()
Out[3]:
household_id age income home_ownership marital_status household_size household_comp kids_count
0 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0
1 1001 45-54 50-74K Homeowner Unmarried 1 1 Adult No Kids 0
2 1003 35-44 25-34K None Unmarried 1 1 Adult No Kids 0
3 1004 25-34 15-24K None Unmarried 1 1 Adult No Kids 0
4 101 45-54 Under 15K Homeowner Married 4 2 Adults Kids 2
In [4]:
demo_tran = demographics.merge(transactions, on = ['household_id'], how = 'inner')
demo_tran = demo_tran[~demo_tran.isnull().any(axis=1)]
demo_tran.head()
Out[4]:
household_id age income home_ownership marital_status household_size household_comp kids_count store_id basket_id product_id quantity sales_value retail_disc coupon_disc coupon_match_disc week transaction_timestamp
0 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0 436 31317046240 823721 1 2.99 0.0 0.0 0.0 2 2017-01-07 18:55:24
1 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0 436 31317046240 832990 2 5.98 0.0 0.0 0.0 2 2017-01-07 18:55:24
2 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0 436 31317046240 854920 1 1.49 0.0 0.0 0.0 2 2017-01-07 18:55:24
3 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0 436 31317046240 856942 1 2.99 0.0 0.0 0.0 2 2017-01-07 18:55:24
4 1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0 436 31317046240 868401 1 0.59 0.0 0.0 0.0 2 2017-01-07 18:55:24
In [5]:
demo_tran['age'].value_counts()
Out[5]:
45-54    189439
35-44    150652
25-34     77813
65+       46000
55-64     40091
19-24     14978
Name: age, dtype: int64
In [6]:
conditions = [
    ((demo_tran['age'] == "19-24") | (demo_tran['age'] == "25-34")),
    ((demo_tran['age'] == "35-44")),
    ((demo_tran['age'] == "45-54")),
    ((demo_tran['age'] == "55-64") | (demo_tran['age'] == "65+"))
    ]

# create a list of the values we want to assign for each condition
values = ['19-34', '35-44', '45-54', '55+']

demo_tran['age_group'] = np.select(conditions, values)
In [7]:
demo_tran['age_group'].value_counts()
Out[7]:
45-54    189439
35-44    150652
19-34     92791
55+       86091
Name: age_group, dtype: int64
In [8]:
demo_tran1 = demo_tran.copy()
demo_tran1 = demo_tran1[(demo_tran1['retail_disc'] > 0) & (demo_tran1['retail_disc'] < 1) ].groupby('age_group')\
                            .aggregate({'sales_value':'sum', 'retail_disc':'sum'})

demo_tran1['percentage'] = demo_tran1['retail_disc']*100/demo_tran1['sales_value']
demo_tran1 = demo_tran1.reset_index()
demo_tran1.head()
Out[8]:
age_group sales_value retail_disc percentage
0 19-34 71387.40 14601.23 20.453511
1 35-44 121789.53 24400.26 20.034776
2 45-54 139258.63 29145.40 20.928972
3 55+ 60278.84 12775.25 21.193590
In [9]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=demo_tran1.age_group, y=demo_tran1.percentage, name="Percent discount used"),
    secondary_y=True,
)

fig.add_trace(
    go.Bar(x=demo_tran1['age_group'], y=demo_tran1['sales_value'], name = 'Total Sales'),
    secondary_y=False,
)

fig.add_trace(
    go.Bar(x=demo_tran1['age_group'], y=demo_tran1['retail_disc'], name = 'Total Discount'),
    secondary_y=False,
)

# Add figure title
fig.update_layout(
    title_text="Retail Discount (for less than $1) v/s Total Sales Revenue (by Age Group)"
)

fig.update_yaxes(range=[0,40], secondary_y=True)
# Set x-axis title
fig.update_xaxes(title_text="Age Group")

# Set y-axes titles
fig.update_yaxes(title_text="Total Sales Value", secondary_y=False)
fig.update_yaxes(title_text="Total percentage of discount used", secondary_y=True)

fig.show()

Business Problem 2¶

Which package sizes generate the highest sales for the top 5 grocery products. To understand which package size to focus while procuring from manufacturer.

In order to generate insights, we first cleaned the data and identified various product categories in Grocery. To determine which package size is generating maximum sales, we then analysed those products by sales along with their package sizes to determine which package size generates the most revenue.

Based on the graph analysis, we discovered the following insights:
i. Drinking water and soft drinks packaged in 12 ounce portions generate the best sales.
ii.Soft drinks is the top product based on sales value.

Below is the code and plots for the same:

In [10]:
products_filter = products[products['department']=="GROCERY"]
product_transactions = products_filter.merge(transactions, on = ['product_id'], how = 'inner')
products_transactions_oz = product_transactions[product_transactions['package_size'].str.contains("OZ", na = False)]
products_transactions_lb = product_transactions[product_transactions['package_size'].str.contains("LB", na = False)]

products_transactions_oz_groupby = products_transactions_oz.groupby(['product_id','product_category','package_size'])\
                                            .aggregate({'sales_value':'sum'}).reset_index()\
                                            .sort_values('sales_value', ascending=False).head(10)

unidff = products_transactions_oz_groupby.groupby('product_category').first().reset_index().sort_values('sales_value', ascending=False)

unidff['package_size'] = unidff.package_size.str.extract('(\d+)')
unidff["package_size"] = pd.to_numeric(unidff["package_size"])
unidff = unidff.rename(columns = {'package_size':'package_size_in_ounces'})
# -----------------------------
products_transactions_lb_groupby = products_transactions_lb.groupby(['product_id','product_category','package_size'])\
                                            .aggregate({'sales_value':'sum'}).reset_index()\
                                            .sort_values('sales_value', ascending=False).head(7)

unidff_lb = products_transactions_lb_groupby.groupby('product_category').first().reset_index().sort_values('sales_value', ascending=False)

unidff_lb['package_size'] = unidff_lb.package_size.str.extract('(\d+)')
unidff_lb["package_size"] = pd.to_numeric(unidff_lb["package_size"])
unidff_lb = unidff_lb.rename(columns = {'package_size':'package_size_in_pounds'})
In [11]:
labels = unidff.product_category
max_sales = unidff.sales_value
package_size_in_ounces = unidff.package_size_in_ounces

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
ax2 = ax.twinx()
rects1 = ax.bar(x - width/2, max_sales, width, label='max_sales', color = 'purple')
rects2 = ax2.bar(x + width/2, package_size_in_ounces, width, label='package_size_in_ounces', color = 'yellow')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Max sales')
ax2.set_ylabel('Package Size in ounces')
ax.set_title('Max sales and package size in ounces')
ax.set_xticks(x, labels)
ax.legend()
ax2.legend()

fig.autofmt_xdate(rotation=45)

plt.show()
In [12]:
prod_chart = unidff.sort_values(by = ['sales_value'],
                    ascending = False).iloc[0:15].reset_index()


fig1 = go.Figure()
# Draw points
fig1.add_trace(go.Scatter(x = prod_chart["sales_value"], 
                          y = prod_chart["product_category"],
                          mode = 'markers',
                          marker_color ='darkblue',
                          marker_size  = 10))
# Draw lines
for i in range(0, len(prod_chart)):
               fig1.add_shape(type='line',
                              x0 = 0, y0 = i,
                              x1 = prod_chart["sales_value"][i],
                              y1 = i,
                              line=dict(color='crimson', width = 3))

fig1.layout.hovermode = False

fig1.update_layout(
    title_text="Top 5 product categories of Groceries having maximum sales<br><sup>This data has been collected over the year 2017</sup>"
)
fig1.show()

Business Problem 3¶

To analyse campaign performance based on coupons redemption rates, total sales generated by the products (which were part of the campaign), households targeted, ages and income ranges of the customers.

  • We utilized the completejourney dataset to identify trends and patterns in the data and generated insights to help the leadership make effective data-driven decisions.

  • Throughout the analysis, we identify the best-case practices and patterns in certain campaigns and suggest areas of improvement to simulate the poorer performing campaigns.

  • We analyzed the campaign performance based on the redemption rates, total sales generated by the products (which were part of the campaign), households targeted, ages and income ranges of the customers.

Insights from the Graphs

  • Type A campaigns are performing better than the rest of the campaigns.
  • Households with a family income between 50-74K are more likely to redeem the coupons followed by households with income range between 35-49K.
  • Age group between the period of 45-54 has the highest number of campaign promotional redemption users followed by the age group 35-44.
In [13]:
campaign_descriptions.head()
Out[13]:
campaign_id campaign_type start_date end_date
0 1 Type B 2017-03-03 2017-04-09
1 2 Type B 2017-03-08 2017-04-09
2 3 Type C 2017-03-13 2017-05-08
3 4 Type B 2017-03-29 2017-04-30
4 5 Type B 2017-04-03 2017-05-07
In [14]:
coupon_redemptions.head()
Out[14]:
household_id coupon_upc campaign_id redemption_date
0 1029 51380041013 26 2017-01-01
1 1029 51380041313 26 2017-01-01
2 165 53377610033 26 2017-01-03
3 712 51380041013 26 2017-01-07
4 712 54300016033 26 2017-01-07
In [15]:
campaigns.head()
Out[15]:
campaign_id household_id
0 1 105
1 1 1238
2 1 1258
3 1 1483
4 1 2200
In [16]:
coupons.head()
Out[16]:
coupon_upc product_id campaign_id
0 10000085207 9676830 26
1 10000085207 9676943 26
2 10000085207 9676944 26
3 10000085207 9676947 26
4 10000085207 9677008 26
In [17]:
# Compute campaign products total sales value for different campaign types

camp_prod_tsv = coupons[['campaign_id', 'product_id']] \
    .merge(campaign_descriptions[['campaign_id', 'campaign_type']], on='campaign_id') \
    .merge(transactions[['product_id', 'transaction_timestamp', 'sales_value']], on='product_id') \
    .groupby('campaign_type') \
    .agg(total_sales=('sales_value', 'sum')) \
    .reset_index() \
    .sort_values('campaign_type')

camp_prod_tsv
Out[17]:
campaign_type total_sales
0 Type A 6817835.35
1 Type B 623471.96
2 Type C 242966.70
In [18]:
# Households Targeted by campaigns 

camp_hh_targeted = campaigns.groupby('campaign_id', as_index=False) \
    .agg(hh_targeted=('household_id', 'nunique')) \
    .sort_values('campaign_id') \
    .merge(campaign_descriptions, on='campaign_id') \
    .loc[:, ['campaign_type', 'campaign_id', 'hh_targeted']] \


camp_hh_targeted_type = camp_hh_targeted.groupby('campaign_type', as_index=False) \
    .agg(total_hh_targeted =('hh_targeted', 'sum')) 

camp_hh_targeted_type
Out[18]:
campaign_type total_hh_targeted
0 Type A 3647
1 Type B 2368
2 Type C 574
In [19]:
# Coupons redemption by households

camp_hh_redeemed = campaigns[['campaign_id']] \
    .drop_duplicates() \
    .merge(coupon_redemptions[['campaign_id', 'household_id']].drop_duplicates(), on='campaign_id', how='left') \
    .groupby('campaign_id', as_index=False) \
    .agg(hh_redeemed=('household_id', 'nunique')) \
    .sort_values('campaign_id') \
    .merge(campaign_descriptions[['campaign_id', 'campaign_type']], on='campaign_id') \
    .assign(campaign_id=lambda x: pd.to_numeric(x['campaign_id'])) \
    .loc[:, ['campaign_id', 'campaign_type', 'hh_redeemed']] 


camp_hh_redeemed_type = camp_hh_redeemed.groupby('campaign_type', as_index=False) \
    .agg(total_hh_redeemed =('hh_redeemed', 'sum')) 

camp_hh_redeemed_type
Out[19]:
campaign_type total_hh_redeemed
0 Type A 604
1 Type B 154
2 Type C 34
In [20]:
# Compute Campaign performance 

camp_perf_df = camp_hh_targeted.merge(camp_hh_redeemed, on='campaign_id', how='inner') \
    .sort_values('campaign_id') \
    .assign(redemption_rate=lambda x: (x['hh_redeemed']/x['hh_targeted'])*100) \
    .rename(columns={'campaign_type_x': 'campaign_type'}) \
    .loc[:, ['campaign_id', 'campaign_type', 'hh_targeted', 'hh_redeemed', 'redemption_rate']] \
    .sort_values(by='redemption_rate', ascending=False)

camp_perf_df
Out[20]:
campaign_id campaign_type hh_targeted hh_redeemed redemption_rate
17 18 Type A 1133 214 18.887908
12 13 Type A 1077 196 18.198700
2 3 Type C 12 2 16.666667
7 8 Type A 1076 158 14.684015
14 15 Type C 17 2 11.764706
18 19 Type B 130 15 11.538462
8 9 Type B 176 20 11.363636
25 26 Type B 118 12 10.169492
15 16 Type B 188 19 10.106383
26 27 Type A 361 36 9.972299
16 17 Type B 202 18 8.910891
9 10 Type B 123 10 8.130081
13 14 Type C 224 18 8.035714
0 1 Type B 13 1 7.692308
3 4 Type B 81 6 7.407407
11 12 Type B 170 11 6.470588
24 25 Type B 17 1 5.882353
21 22 Type B 276 15 5.434783
4 5 Type B 166 8 4.819277
19 20 Type C 244 11 4.508197
1 2 Type B 48 2 4.166667
20 21 Type B 65 2 3.076923
10 11 Type B 214 6 2.803738
6 7 Type B 198 5 2.525253
22 23 Type B 183 3 1.639344
5 6 Type C 65 1 1.538462
23 24 Type C 12 0 0.000000
In [21]:
# Compute max campaign ID
max_camp_df = campaigns[['campaign_id']].astype({'campaign_id': 'int'}).agg({'campaign_id': 'max'}).reset_index()
maximum_campaign_id = max_camp_df[0][0]
maximum_campaign_id
Out[21]:
27
In [22]:
# Campaign performance vs campaign type scatter plot

sns.set(rc={'figure.figsize':(11.7,8.27)})
plot= sns.catplot(data=camp_perf_df, x="campaign_id", y="redemption_rate", col = 'campaign_type', hue= 'campaign_type')
plot.set_axis_labels("campaign_id", "redemption_rate")
plot.fig.subplots_adjust(top=0.8)
plot.fig.suptitle("Overall Campaign Performance")

for ax in plot.axes.flat:
    labels = ax.get_xticklabels() # get x labels
    for i,l in enumerate(labels):
        if(i%4 != 0): labels[i] = '' # make every 4th label visible
    ax.set_xticklabels(labels) # set new labels
plt.show()
sns.despine(left=True)
<Figure size 842.4x595.44 with 0 Axes>
In [23]:
# Coupon Redemption Of Household – By Income Range

demo_income = pd.merge(demographics, coupon_redemptions, on='household_id', how='left')
demo_income['redemption'] = demo_income['redemption_date'].apply(lambda x: 'Never Used a promotion' if pd.isna(x) else 'Used a Promotion')
demo_income = demo_income[['age', 'income', 'redemption']]
demo_income = demo_income.groupby(['income', 'redemption']).size().reset_index(name='hh_count')

demo_income
Out[23]:
income redemption hh_count
0 100-124K Never Used a promotion 20
1 100-124K Used a Promotion 46
2 125-149K Never Used a promotion 19
3 125-149K Used a Promotion 98
4 15-24K Never Used a promotion 57
5 15-24K Used a Promotion 75
6 150-174K Never Used a promotion 13
7 150-174K Used a Promotion 115
8 175-199K Never Used a promotion 6
9 175-199K Used a Promotion 21
10 200-249K Never Used a promotion 4
11 200-249K Used a Promotion 4
12 25-34K Never Used a promotion 50
13 25-34K Used a Promotion 129
14 250K+ Never Used a promotion 9
15 250K+ Used a Promotion 3
16 35-49K Never Used a promotion 116
17 35-49K Used a Promotion 234
18 50-74K Never Used a promotion 105
19 50-74K Used a Promotion 593
20 75-99K Never Used a promotion 65
21 75-99K Used a Promotion 247
22 Under 15K Never Used a promotion 42
23 Under 15K Used a Promotion 104
In [26]:
# Plot Households Redeeming Promotions by Income Range

demo_income_plot = sns.barplot(x='hh_count', y='income', hue='redemption', data=demo_income, palette=['#E69F00', '#56B4E9'])
demo_income_plot.set(title='Households Redeeming Promotions by Income Range', xlabel='Number of Households', ylabel='Income Range')
demo_income_plot.legend(title='Coupon Redemption?')
demo_income_plot.figure.subplots_adjust(left=0.3)
demo_income_plot.figure.set_size_inches(8, 6)
demo_income_plot.set_axisbelow(True)
In [24]:
# Coupon Redemptions Of Household – By Age

demo_age = pd.merge(demographics, coupon_redemptions, on='household_id', how='left')
demo_age['redemption'] = demo_age['redemption_date'].apply(lambda x: 'Never Used a promotion' if pd.isna(x) else 'Used a Promotion')
demo_age = demo_age[['age', 'income', 'redemption']]
demo_age = demo_age.groupby(['age', 'redemption']).size().reset_index(name='hh_count')

demo_age
Out[24]:
age redemption hh_count
0 19-24 Never Used a promotion 37
1 19-24 Used a Promotion 40
2 25-34 Never Used a promotion 92
3 25-34 Used a Promotion 329
4 35-44 Never Used a promotion 123
5 35-44 Used a Promotion 412
6 45-54 Never Used a promotion 170
7 45-54 Used a Promotion 664
8 55-64 Never Used a promotion 36
9 55-64 Used a Promotion 123
10 65+ Never Used a promotion 48
11 65+ Used a Promotion 101
In [25]:
# Plot Households Redeeming Promotions by Age

plt.figure(figsize=(8,6))
sns.barplot(x='hh_count', y='age', hue='redemption', data=demo_age, orient='horizontal', alpha=1)
plt.xlabel('Number of Households')
plt.ylabel('Age Group')
plt.title('Number of Households Redeeming Promotions by Age')
plt.legend(title='Coupon Redemption?')
plt.show()

Summary

  • Target 45-54 age group while giving discounts since its yielding higher sales.
  • Drinking water and soft drinks packaged in 12 ounce portions to be sold more as they generate the best sales.
  • Type A campaigns are performing better than the rest of the campaigns.
  • To attain Maximum redemption Rate, we should run the campaign on Households with a family income of 100k and above.
  • Young adults in the age group of 19-24 and older citizens above age 55 have been least targeted by campaigns and have least promotional benefit users. We cam launch personalized campaigns for these age groups to boost sales.

Limitations

  • Coupon related sales information is not reflected in the transaction dataset. Hence the analysis was made for the total sales of the products targeted by the campaigns.
  • We have transaction data only for 2017 whereas campaign data is present from 2016 - 2018. Hence the analysis was limited to 2017. There is no granular information about the characteristics and other attributes about campaigns and coupon.
  • There were too many scales of package sizes. We have carried out our analysis towards pounds and ounces.

Future Scope

  • In our analysis we have taken discount versus age group. Similar analysis can be done on other Demographic information.
  • We could further analyse Product v/s Discount analysis in future which could further generate more insights
In [ ]: